Received: 5 December 2022 


Revised: 17 August 2023 


® Check for updates 


Accepted: 11 September 2023 


DOT: 10.1002/aaai.12128 


ARTICLE 


In 


Video Turing Test: A first step towards human-level AI 


Minsu Leet? © | Yu-JungHeo’® | Seongho Choi © | WooSukChoi’® | 


Byoung-Tak Zhang’” © 


lArtificial Intelligence Institute, Seoul 
National University, Seoul, Republic of 
Korea 


2Seoul National University, Seoul, 
Republic of Korea 


3KT, Seoul, Republic of Korea 


Correspondence 

Byoung-Tak Zhang, Artificial Intelligence 
Institute, Seoul National University, 
Seoul, Republic of Korea. 

Email: btzhang@bi.snu.ac.kr 


Yu-Jung Heo: This work was carried out 
at Seoul National University. 


Funding information 

Institute of Information & 
Communications Technology Planning & 
Evaluation, Grant/Award Numbers: 
2017-0-01772-VTT, 2022-0-00951-LBA, 
2022-0-00953-PICA; National Research 
Foundation of Korea, Grant/Award 
Numbers: 2021R1A2C10-10970, 
RS-2023-00274280 


INTRODUCTION 


Abstract 

The development of artificial intelligence (AI) agents capable of human-level 
understanding of video content and conducting conversations with humans on 
this basis is a promising application that people expect. However, this is a chal- 
lenging task that requires the holistic integration of multimodal information 
with temporal dependencies and reasoning, as well as social and physical com- 
monsense. In addition, the development of appropriate systematic evaluation 
methods is essential. In this context, we introduce the Video Turing Test (VTT), 
a blind test used to evaluate human-likeness in terms of video comprehension 
ability. Moreover, we propose Vincent as a video understanding AI. We explain 
the configuration of VTT, the architecture of Vincent to prepare for VIT and 
the proposed evaluation methods for video comprehension. We also estimate the 
current intelligence level of AI based on our results and discuss future research 
directions. 


easy-to-understand medium for conveying stories and 
messages. 


Development of video understanding AI 


Video content understanding is an essential component 
of artificial intelligence (AI) as it enables machines to 
comprehend and interpret audio-visual information in 
a way that is similar to humans. Videos are a rich 
source of information that portray real-world environ- 
ments and the lives of people therein, providing an 


Minsu Lee and Yu-Jung Heo contributed equally to this study. 


Figure 1 depicts the various semantic components that 
are involved in understanding video content, such as 
scenes, environments, objects, actions, events, attributes, 
and concepts (Diba et al. 2020). Therefore, the construc- 
tion of AI that can comprehend video contents requires 
technical advancements that enable the recognition and 
processing of human language, visual and auditory per- 
ception, and spatiotemporal information in an integrated 
way. By developing video understanding AI, machines can 
interpret audio-visual information like humans, improv- 
ing their ability to interact with and understand the world. 


This is an open access article under the terms of the Creative Commons Attribution- NonCommercial License, which permits use, distribution and reproduction in any 


medium, provided the original work is properly cited and is not used for commercial purposes. 
© 2023 The Authors. AI Magazine published by John Wiley & Sons Ltd on behalf of Association for the Advancement of Artificial Intelligence. 


AI Magazine. 2023;44:537-554. 


wileyonlinelibrary.com/journal/aaai | 537 


AI MAGAZINE 


= fp 


Person identification 


and tracking Sound recognition 
Person/object detection 
Situation change Background recognition 
detection Action recognition 
Intention Knowledge extraction 
recognition from conversations 
. Relation inference 
Semantic 


, among events 
vocabulary mapping 


Commonsense/ 


Emotion recognition background knowledge 


FIGURE 1 
multiple semantic components. 


Moreover, such AI offers numerous practical applications, 
including those in autonomous driving, surveillance, and 
healthcare. 

The research topics for developing video understand- 
ing AI include text-to-video retrieval, video captioning, and 
video question answering, and the creation of benchmark 
datasets that are available to the public (Lei et al. 2018; 
Alamri et al. 2019; Miech et al. 2019; Diba et al. 2020; Choi 
et al. 2021). These tasks require the AI to recognize and 
comprehend the various aforementioned semantic compo- 
nents in a given video. To evaluate the performance of the 
video understanding AI more efficiently, various metrics 
are used. For multiple-choice question answering, accu- 
racy is the commonly used metric. For language generation 
tasks such as video captioning, metrics like BLEU (Pap- 
ineni et al. 2002), METEOR (Banerjee and Lavie 2005), and 
CIDEr (Vedantam, Lawrence Zitnick, and Parikh 2015) are 
used. For retrieval tasks, precision and recall are used as 
evaluation measures. 

However, these tasks predominantly focus on specific 
aspects of video understanding, such as encoding cross- 
modality in joint embedding spaces or answering a ques- 
tion about a simple event that occurred in a short video 
clip. Furthermore, evaluations are typically performed by 
comparing the trained models to human performance indi- 
rectly. Therefore, there is a need to design a well-structured 
task and a more direct evaluation framework for com- 
prehensive video understanding. To address this need, we 
introduce the Video Turing Test (VTT). 


Turing Test 


First, we briefly explain the original Turing Test and 
its variants which were recently proposed to extend the 
original Turing Test. 


Illustration of a video-understanding AI. Video understanding is a comprehensive task that encompasses the recognition of 


In 1950, Alan Turing proposed a variation of the imi- 
tation game, now known as the Turing Test, where a 
computational machine and a human player communi- 
cate with an interrogator using text while being invisible 
to each other (Turing 1950). The objective of the machine 
is to ensure that the interrogator cannot distinguish it from 
the human player. It is widely accepted that machine intel- 
ligence can be considered as human-like once it achieves 
this objective. 

Although the Turing Test was proposed as a novel esti- 
mation method, it has been criticized for the following 
limitations. First, a machine can pass the test based on 
“trickery or guile” (Shieber 1994; Boden 2006). Second, 
the Turing Test focuses primarily on linguistic capabil- 
ity without considering additional components of human 
intelligence (e.g., visual understanding). Finally, the pass- 
ing criterion for the Turing Test, that is, the interrogator’s 
judgment is quite subjective. 

Several new suites of tests have been proposed by 
extending the original Turing Test to compensate for the 
aforementioned limitations. The tests adopt diverse tasks 
to evaluate AI from multiple perspectives, such as visual 
understanding, emotional understanding, and so forth. 
McKinstry (1997) and Geman et al. (2015) considered only 
Boolean-type questions about a series of facts and an 
image scene, respectively, to provide a more quantitative 
measure of intelligence that is not affected by the use of 
trickery or guile. Olague et al. (2021) suggested a purely 
visual processing-based Turing Test that avoids reliance on 
conversational ability and imitates the emotional interpre- 
tation of humans. Adams, Banavar, and Campbell (2016) 
presented I-athlon as a multidimensional Turing Test. The 
authors assessed the conversational abilities of AIs as well 
as a wide variety of intelligent behaviors, such as video 
understanding. On the other hand, Adiwardana et al. 
(2020) proposed the Sensibleness and Specificity Average 


ISU SUOWIOD aANeaID afqvordde ay) Áq polars are saone YO ‘asn Jo SAMI Joy Aresqry] IUUQ ÁM UO (SUONIPUOD-pur-sWIA}/LUO: Á M ÁreIqr oUr u0//:SdNY) SUONIPUOJ pur SUNAL ay) 99g “[EZOZ/TI/ET] UO AreAQr] OUTTUG ATLA ‘Purjog IWY Áq YZIZIWL/ZOOT 01/10P/W02 Kopi: Arviquaurpuo//:sdyy Woy papropumog “p ‘EZOZ ‘IZIGILET 


AI MAGAZINE 


Ini 


TABLE 1 Variants of the Turing Test and its characteristics. H and A are abbreviations for Human evaluation and Automatic evaluation, 


respectively. 
Variants of Turing Test Contents Judge Characteristics and examples 
Minimum Intelligent Signal Test Text H Answering Boolean questions about specific knowledge or fundamental facts 
(McKinstry 1997) related to the meaning of various concepts. 
[Example] Question: Is Earth a planet? Answer: Yes 
Winograd Schema Challenge Text A Answering multiple-choice questions derived from Winograd schemas. 
(Levesque, Ernest, and Leora [Example] Question: The trophy would not fit in the brown suitcase because it 
2012) was too big. What was too big? Answer: The trophy 
Visual Turing Test (Geman etal. Image A Questioning Boolean questions about ordinal images of natural scenes. 
2015) 
Question: Is there a person in the blue region? Answer: Yes 
Social-Emotional Turing Text H Predicting an emotion label for the target scenario and generating an appropriate 

Challenge (Jarrold and Yeh explanation. 

2016) [Example] Target scenario: Tracy wants a banana. Mammy gives Tracy an 
apple. How will Tracy feel? Emotion label: Sad, Appraisals: She did not get 
what she wanted. 

Visual Question Answering Image A Answering visual questions for real-world images and abstract scenes or cartoon 

(Zitnik et al. 2016) images. 

[Example] 
Question: How many slices of pizza are there? Answer: 6 
AI Science Challenge (Schoenick Text A Answering standard 8th-grade multiple-choice science questions. 

et al. 2017) [Example] Question: How many chromosomes does the human body cell 

contain? Answer candidates: (A) 23 (B) 32 (C) 46 (D) 64. Answer: (C) 46 
Multiturn Open-Domain Chatbot Text H Generating human-like multiturn conversation in an open-domain setting. 

(Adiwardana et al. 2020) [Example] Human: I like how cute cows are. Meena: Ah, interesting. You like 
how they are cute, right? Human: Yeah, and they’re really smart. Meena: I 
heard they go to college. 

Visual Turing Test with Kuleshov Image A Interpreting visual stimuli by emulating the Kuleshov effect. 
Effect (Olague et al. 2021) [Example] 
Fear 
awe 
Ours: Video Turing Test Video H,A Answering emotion, commonsense, background information, conversation, and 


intention-related questions about given drama video scenes. 


[Example] 


Question: What did Dokyoung do to share an umbrella with Haeyoung? 
Answer: Dokyung gave his umbrella to an older man. 


metric, which captures two aspects (i.e., making sense and evaluation metrics. Some prior studies have advocated for 
being specific) of human-likeness in conversation. direct human evaluation, but they typically concentrate on 

As indicated in Table 1, the tests have different character- specific abilities relating to language modality. However, 
istics, and most exploit indirect evaluation using automatic the proposed VTT focuses on the diverse capabilities that 


ISU SUOWIOD aANeaID afqvordde ay) Áq pourəao3 are saone YO ‘asn Jo SAMI 107 AreAqr] IUUQ ÁM UO (SUONIPUOD-pue-sWLIA}/LUOD: Á M ÁreIqr oUr u0//:SdNY) SUONIPUOJ pur SUNAL IP 99g “[EZOZ/TI/ET] UO AreAgr] OUTTUG ALAA ‘Purjog IWY Aq SZIZIWweL/ZOOT OI MOP/O Kopi Arviquurpuo//:sdy Woy papropumog “p ‘EZOZ ‘IZIGILET 


| Ap 


AI MAGAZINE 


are necessary for comprehending video content and pro- 
vides both indirect and direct comparisons of an AI agent 
with a human in terms of intelligence. Furthermore, our 
test measures the level of AI technology progress in detail 
according to the age group or skill level of the compared 
human player. This is the most intuitive measure for the 
human-likeness of AI, and not an abstract computational 
measure. 


Design principles of Video Turing Test 


In this work, we employ a single-turn question-answering 
task for the proposed VTT. The objective of our AI agent 
is to answer questions requiring diverse capabilities for 
video comprehension. To compensate for the previously 
mentioned vulnerabilities of the Turing Test, we consider 
several complementary approaches. Success via trickery 
or guile can be largely avoided by asking specific ques- 
tions to evaluate the understanding of watched videos. To 
this end, we include multiple-choice questions as well as 
open-ended questions that require answers in the form of 
complete sentences in the proposed VTT. Multiple-choice 
questions are preferred over Boolean-type questions as the 
former reduces the probability of guessing correct answers 
accidentally. The latter is selected to emulate humans, 
who usually answer questions in the form of complete 
sentences. By utilizing remarkably developed language 
generation technology, a sentence-type answer can be gen- 
erated instead of a short-answer type and compared with 
those of humans. 

Moreover, to avoid excessive focus on linguistic capacity, 
we consider various questions including other components 
of human intelligence that are required for video com- 
prehension, such as visual understanding, intention or 
context comprehension, and commonsense. On the other 
hand, passing the Turing Test is largely determined by 
the characteristics of the person that the AI agent is com- 
pared to or the subjectivity of the juror. This limitation is 
ameliorated by including multiple human players in VTT 
and using multiple interrogators for evaluation. Further- 
more, we quantitatively evaluate the level of intelligence 
and human-likeness of the AI agent based on its accuracies 
with respect to questions of varying difficulty, comparisons 
of its answers with those of people of various ages, and 
evaluations based on the cognitive processes of humans on 
story elements. 

The QA task is a simple but effective evaluation method 
for assessing the level of achievement of the ability by 
reviewing the answer. In particular, the QA task becomes 
more powerful when creating and asking “good ques- 
tions with intentions” that can identify the items to be 
assessed. We systemically prepared good questions to eval- 


uate the level of video understanding intelligence and 
used VideoQA as a “tool” to evaluate the video under- 
standing ability in VTT. Moreover, VTT is designed with 
the objective of evaluating the video understanding abil- 
ity compared to the human level, rather than simply 
measuring the percentage of correct answers. 

VTT offers several advantages over the standard 
VideoQA task, as follows: VTT (1) utilizes questions with 
intentions based on story element analysis for video 
understanding, (2) conducts a direct comparison with 
answers of various ages, (3) distinguishes whether the 
answers are human-like by jurors, and (4) analyzes mul- 
tifaceted video cognitive ability through various cognitive 
module-based analyses. 

In the following sections, we introduce VTT, which is 
a blind test to evaluate human-likeness in terms of video 
comprehension ability, the configuration of VTT, and the 
proposed evaluation methods for video comprehension. 
We explain the architecture of Vincent, which is the pro- 
posed video-understanding AI to prepare for VTT. We also 
discuss the current intelligence level of AI based on our 
results. 


VIDEO TURING TEST 


VTT was performed on November 26, 2021, at Seoul 
National University. This test was a public demonstration 
of our video intelligence platform (VIP) and video intel- 
ligence evaluation methods. VIT comprised a blind test 
comparing VIP to human players of different ages in terms 
of the video comprehension ability. 

Prior to the aforementioned VTT event, we organized 
VTT workshops at several international conferences to 
share the perspectives of experts from various fields, 
including vision, language processing, multimedia, and 
speech recognition, on data-driven video comprehension 
(VTT Workshop on ICCV 2019; VTT Workshop on ECCV 
2020). We also hosted DramaQA challenges at the Korea 
Software Congress (KSC) 2019 and ECCV 2020 to assess 
state-of-the-art methodologies and encourage significant 
progress in this field (Choi et al. 2021; DramaQA Chal- 
lenge on KSC 2019; DramaQA Challenge on ECCV 2020). 
Moreover, we reviewed and improved VTT procedures and 
evaluation measures using two small-scale mock VTTs 
(Heo et al. 2021; Shin et al. 2021; Heo et al. 2019). 


VTT procedure 


VTT procedure is illustrated in Figure 2. We used a blind 
quiz show format with an independent booth for each 


ISU SUOWIOD aAeaD afqvordde ay) Áq pourəao3 are saone YO ‘asn Jo SAMI 107 Aresqry] IUUQ ÁM UO (SUONIPUOD-pur-sWHIA}/LUO: Á M ÁreIqr oUr u0//:SdNY) SUONIPUOJ pur SUNAL IP 92g “[EZOZ/TI/ET] UO AreAgr] uuo ATLA ‘Purjog auRIYDOD Áq YZIZIWweL/ZOOT 01/10P/U02 Kopi Arviquaurpuo//:sdyy Woy papropumog *p ‘EZOZ ‘IZIGILET 


AI MAGAZINE 


I= 


VideoQA: In each round, three video clips and question pairs are provided. 


se 


1) All players and jurors watch a video clip. 


2) A question about the video clip is provided. 


3) The hidden players submit their answers to 
the question. 


4) All submitted answers are displayed. 


Voting at the end of each round 


Result: Pass or Fail 


FIGURE 2 Overall procedure of the Video Turing Test (VTT). 


player. All participants were isolated from one another and 
they were not visible to the jurors. 

In each round, three video clips and question pairs were 
provided. All players and jurors watched a video clip, and 
then a question about it was posed to both the players and 


5) After the completion of all VideoQA in 
a round, the jurors vote for the player 
expected to be an AI agent. 


6) A text message-based voting system 
aggregates the votes. 


7) The results of the votes from the juries 
are released. 


“poygnys Ajuopurs ore suontsod 19k] g (OT 


8) The player who received the most votes 
is revealed first, and then all hidden 
players are revealed. 


9) If the AI agent has not received the most 
votes, it passes the VTT in this round. 


jurors. Each player submitted their answer to the ques- 
tion secretly. The submitted answers were presented to the 
jurors. 

After repeating this process for all video clips, the jurors 
voted for the player that they expected to be an AI agent. 


ISU SUOWILOD aANeaD afqvordde ay) Áq pousaros are saone YO ‘asn Jo SAMI 107 AreAqry] IUUQ ÁM UO (SUONIPUOD-pur-sULIA}/LUOD Á M ÁreIqr Ur u0//:SdNY) suOMIPUOD pur SUNAL IP 92g “[EZOZ/TI/ET] UO AreAgr] unuo ALAA ‘Purjog IWY Áq SZIZIWL/ZOOT 01/10P/W02 Kopi: Arvquaurpuo//:sdyy Woy papropumog “p ‘EZOZ ‘IZIGILET 


| Upp 


AI MAGAZINE 


A text message-based voting system was used to aggregate 
the vote counts, and the results were presented. 

Finally, the player who received the most votes was 
revealed first, and subsequently, all other players were 
revealed. If a human player received a higher or equal 
number of votes than the AI agent, the AI agent was 
deemed to have passed VTT corresponding to the subject 
of that round. The positions of all players were randomly 
shuffled before the start of the next round. 


VTT rounds 


VTT consisted of five rounds corresponding to different 
subjects of cognitive ability, and a few pairs of video clips 
and questions were presented in each round to evalu- 
ate the players’ capabilities on the corresponding subject. 
We organized five rounds with the following evaluation 
objectives: 


Emotion recognition (Emotion): This round 
assessed the understanding capability of human 
emotion based on Paul Ekman’s six basic emotions 
(joy, sadness, anger, fear, disgust, and surprise) 
(Ekman 1989) as well as more complex emotions. 
Recently, several studies have been conducted 
on human emotion recognition based on images 
or textual scenarios (Jain et al. 2018; Jarrold and 
Yeh 2016). The results of this round could be 
used as a reference to estimate the current level 
of understanding of the AI agent of the complex 
emotional spectrum of humans. 

Knowledge-based reasoning (Commonsense): 
This round required the ability to infer causes or 
effects that were not directly visible in a video. The 
capability to understand behaviors and situations 
based on tacit knowledge, such as common- 
sense, natural causality, synonym processing, and 
universal moral concepts, was evaluated based 
on the results of this round. Although several 
commonsense-based projects, such as Cyc (Lenat 
and Guha 1989) and ConceptNet (Speer, Chin, and 
Havasi 2017), have been developed over the past 
30 years, the construction of commonsense-based 
reasoning systems remains an unsolved problem 
in AGI. The results of this round are expected 
to enable commonsense-based reasoning using 
video content based on deep neural networks and 
ConceptNet. 

Recalling background information (Back- 
ground): In this round, the ability to recall 
background information displayed on the screen, 
such as clothes, shoes, and passing cars, was 


evaluated. As background information is usu- 
ally not important for understanding scenes or 
stories, humans often do not retain this informa- 
tion. In contrast, based on recent developments 
in image comprehension technologies, the AI 
agent is expected to exhibit decent performance 
in this round. This affects the evaluation of 
human-likeness by the jurors in an interesting 
manner. 

Conversation context comprehension (Conver- 
sation): In this round, comprehension of the 
intention and meaning of speech based on prior 
experience or knowledge was evaluated. This round 
was essential for the assessment of the level of 
video-understanding intelligence because under- 
standing conversational context is a key element of 
story comprehension. 

Action-intention reasoning (Intention): In this 
round, the ability to guess the intent of an action not 
directly shown in the video was evaluated. Video 
content does not usually depict the thoughts of 
characters explicitly. Hence, viewers must guess the 
intentions of the characters based on the flow of 
events and the role-playing of the actors. Humans 
usually rely on direct and indirect experiences to 
infer the intention behind an action, thus, this 
round represented a very challenging task for the 
Al. 


Video clips and questions 


Drama was deemed the most suitable genre for VTT as 
it involves stories about peoples’ daily lives. Furthermore, 
the components of drama, for example, real-world scenes, 
dialogs, sound effects, and textual information, are effec- 
tive for the development and evaluation of human-friendly 
AI as they contain adequate training resources for Als to 
learn to see, hear, speak, and react like humans (Tapaswi 
et al. 2016). In the proposed VTT, we used shot and scene 
clips that were sampled from the popular Korean drama 
“Another Miss Oh,” which comprises 18 episodes with a 
total playtime of 20.5 h. The concluding part of the series— 
Episodes 16—18—was used in VTT. We obtained official 
permission from the relevant content provider to use this 
drama for our research. 

We requested cognitive scientists to compose 30 new 
questions corresponding to the five VTT rounds to ensure 
cognitively detailed evaluation and reduce similarity with 
the training data for VIP. They considered (i) the distribu- 
tion of literary elements in a story, such as the characters, 
settings, and plots; (ii) question types (i.e., 15 multiple- 
choice questions vs. 15 open-ended questions); and (iii) 


ISU SUOWIOD aANeaD arqvordde ay) Áq pourəao3 are saone YO ‘asn Jo SAMI 107 Aresqry] IUUQ ÁM UO (SUONIPUOD-pue-sULIA}/LUOD: Á M Á1reIqr Ur u0//:SANY) SUONIPUOJ pur SUNAL IP 99g “[EZOZ/TI/ET] UO AreAgr] OUTTUG ATLA ‘Pur[oOg IWY Áq YZIZIWwL/ZOOT 01/10P/W02 Kopi Arviquaurpuo//:sdyy Woy papropumog “p ‘EZOZ ‘IZIGILET 


AI MAGAZINE 


I= 


the video length related to each question (i.e., 15 questions 
each corresponding to the shot and scene clips). 

Considering the time limit of the live TV show format, 
only half of the questions were used for VTT, including 12 
open-ended questions and three difficult multiple-choice 
questions. The answers to the remaining questions were 
used during the fine-grained evaluation of the AI agent 
compared to human players. 


Human players and AI agent 


Human intelligence is not uniform—cognitive aspects dif- 
fer considerably by age group according to the correspond- 
ing developmental stages and individual characteristics. 
Each person exhibits different strengths and weaknesses 
in cognitive abilities. For example, children generally excel 
at recalling but struggle with reasoning, whereas certain 
adults may focus on the main character of a video clip and 
ignore their surroundings. Therefore, judging the human- 
likeness of an AI agent may be influenced by the diverse 
characteristics of the jurors and human players, which 
makes the generalization of test results difficult. 

To address this issue, a group of multiple human play- 
ers rather than a single human player was considered in 
VTT, enabling the derivation of more concrete and defini- 
tive results. We considered four human players belonging 
to different cognitive developmental stages in terms of age 
ranges. The four players were aged 6, 9, 12, and 18 years. 
Vincent, which is an instantiation of the proposed VIP 
for VideoQA that can answer both multiple-choice and 
open-ended VideoQA, was selected as the AI agent. 

This study was reviewed by and received ethical 
clearance from the Dongguk University Institutional 
Research Board (DUIRB-202104_21). Informed consent 
was obtained from all participants or their parents, in the 
case of underage subjects, after all procedures and their 
rights had been fully explained. 


Jury 


We invited 34 jurors aged between 20 and 70 of both sexes. 
Sixteen jurors were experts in AI, especially, language, 
speech, and vision, and others were non-AI experts. The 
jurors were provided with a judgment sheet to record their 
predictions corresponding to each QA. The jurors were 
instructed to guess the identity of the AI based on the 
judgment sheet as a cue after each round. 


Evaluation metrics used in VTT 


The aim of the proposed VTT was to evaluate the capa- 
bility of an AI agent of understanding video content, 


and provide a direction for future research based on the 
current strengths and weaknesses of the AI agent by 
analyzing the results multidimensionally. To this end, a 
carefully selected set of questions was required to evaluate 
each player’s video comprehension ability systemically. In 
addition, the identification of suitable evaluation metrics 
for video comprehension intelligence with appropriate, 
measurable, easy-to-understand, and comparable charac- 
teristics was essential. The current status of techniques was 
adjudged in terms of these metrics and the performance 
of the AI agent as well as the reliability and usefulness of 
VTT were improved on this basis. In this study, the follow- 
ing evaluation criteria were used to analyze VideoQA and 
VTT results: 


Percentage of votes cast: This metric was used to 
estimate the human-likeness of the AI agent. After 
the completion of VideoQA in each round, the 
jurors voted for the player that they expected to 
be an AI agent. Therefore, the player who received 
the most number of votes was considered to be the 
most AlI-like, that is, nonhuman-like, player. Thus, 
if the AI agent did not receive the majority of the 
votes, it was considered to exhibit human-like video 
comprehension ability. 

Accuracy: This metric measured the correctness or 
specificity of the submitted answers and could be 
used as a simple measure of the level of video 
comprehension intelligence of each player. The 
accuracy of each answer to the multiple-choice 
questions was determined simply based on the cor- 
rectness of the answer. However, answers to the 
open-ended questions required careful evaluation 
to overcome the previously discussed limitations 
of the Turing Test. In particular, this evaluation 
method is needed to measure the correctness, speci- 
ficity, and contextual meaning of each answer, 
as well as its level of ambiguity. Thus, analytic 
rubrics, which is a method for task-specific per- 
formance assessment based on detailed predefined 
assessment criteria and levels of achievement, were 
adopted in the proposed VTT. The evaluators—who 
composed questions for VTT— defined key story 
elements, essential terms and concepts that ought 
to be included in each answer beforehand, and 
assigned different weights to different criteria based 
on the significance of each question. The overall 
level of achievement for each question was calcu- 
lated as the weighted sum of the obtained scores for 
each criterion. 

Accuracy by question difficulty: If the questions 
for VTT are organized according to the character- 
istics or difficulty of each question, the accuracy 
of each group of questions can be leveraged to 


ISU SUOWIOD ANLAJ arqvoydde ay) Áq pourəao3 are saone YO ‘asn Jo SAMI 107 Aresqry] IUUQ ÁM UO (SUONIPUOD-pur-sWIA}/WUOD: Á M ÁreIqr oUr u0//:SdNY) SUONIPUOJ pur SUNAL IP 99g “[EZOZ/TI/ET] UO AreAgr] OUTTUG KLM ‘Purjog auRIYDOD Áq BZ IZI ReL/ZO0T 01/10P/U02 Kopi: Arviquaurpuo//:sdyy Woy papropumog *p ‘EZOZ ‘IZIGILET 


Input and Output: Audio Signals Video Data Management Engine Based on Crowd-Sourcing System 


Semi-automatic Video Data 


(e.g., QA, 
Captioning, 
Meta-data) 


Video Data Video Data Video Metadata 
Processing Editing System Error Correction 


Video Metadata 
Generation 


544 p AI MAGAZINE 


Speech Recognition Engine 


t 


Perception and Cognition Engine 


Speech-to-Text Text-to-Speech 


(STT) 


Content-based 
Knowledge 


Action Recognition Emotion Recognition Context Understanding 


Dialog Management Engine Conversation 


Understanding 


Person/Object 
Recognition 


Background 


A Recognition 
Dialog State E 


Tracking 


Syntax 
Analysis 


World 
Knowledge 


Sound Recognition Event Detection Knowledge Extraction 


Semantic Dialog Context 


Analysis Modeling Causality 


Detecti : 
Scene Detection Understanding 


Character Identification 


Intention Conversation 
Classification Generation Character Tracking 


Commonsense 
Knowledge 


Intention Recognition Relational Inference 


ISU SUOWWOJ aANaID AqVIrdde ay) Áq pourəao3 are saone YO ‘asn JO SAMI 107 Aresqr] IUUQ ÁM UO (SUONIPUOI-puvĽ-sw19/W09 Á M ÁreIqr oUr u0//:SdNY) SUONIPUOJ pUe SUNAL IP 99g '[EZOZ/ZI/ET] UO Aeaqrg unuo KLM ‘Purjog auRIYDOD Áq 8ZIZI ReL/ZO0T 01/10P/U02 Kam AreIqrouruo//:sdny Woy papeoumod “p ‘EZOZ ‘IZIGILET 


1 Hierarchical Memory-based QA 
Attention-based QA 
Knowledge-based QA 

p reeeo h 

Question Difficulty Estimation Answer Selection 


D Open-ended VideoQA 


Question Difficulty Estimation Answer Selection 


Video s Multiple-choice VideoQA 


Question 


Story 
Narrative 


Video Question Answering Engine 


FIGURE 3 Overview of video intelligence platform (VIP) for Video Turing Test (VTT). 


estimate the strengths and weaknesses of players story elements that were associated with a question 
based on the characteristics of each group. In this when the predicted answer was correct, otherwise, 
study, we grouped the questions into four diffi- it was set to 0. For the open-ended QA, the scoring 
culty levels based on the memory capacity (shot or criteria were extended by adopting analytic rubric- 
scene) and logical complexity that were required for based accuracy. The partial scores corresponding 
problem-solving (Heo et al. 2019; Choi et al. 2021). to the story elements associated with each ques- 

CogME analysis: Cognitive Modules for Evaluation tion were obtained based on the inclusion of the 
(CogME) (Shin et al. 2021) represents the degree of story elements in the generated answer. The afore- 
achievement in terms of the literary elements of a mentioned analysis enabled the identification of 
story. CogME consists of three cognitive modules— the strengths and weaknesses of each of the video 
Target, Content, and Thinking—based on different comprehension intelligence of each player in great 
human cognitive processes and story elements. detail in terms of cognitive modules. 


Each module corresponds to particular literary ele- 

ments of a story, and the elements can be assigned 

to each question to specify the information, knowl- VIDEO INTELLIGENCE PLATFORM 

edge, or thinking capacity that is required to answer 

the question. We used the scoring criteria suggested To develop an AI agent for video understanding, we con- 
by Shin et al. (2021) for evaluation. For the multiple- structed VIP—an integrated system for understanding 
choice QA, the accuracy was defined as the binary video by compiling various related studies. VIP is designed 
score representing the consonance between par- to include a wide range of engines essential for building 
ticipants’ answers and ground-truth answers. For — video-understanding AI. Figure 3 depicts the overall archi- 
example, an accuracy of 1 was assigned to the tecture of VIP, including the following engines (Bebensee 


AI MAGAZINE 


In = 


and Zhang 2021; Kim et al. 2020; Kim et al. 2019; Yu, Kim, 
and Kim 2018; Kim et al. 2018; Na etal. 2017; Kim et al. 2021; 
Ryu et al. 2021; Yu et al. 2020; Lee, Heo, and Zhang 2018). 

The speech recognition engine enables human inter- 
action by receiving voice signals and responding with a 
human-like voice. It forwards the conversation input to 
the dialog management engine and receives responses. 
The dialog management engine uses contextual analysis 
based on dialog history to maintain conversational flu- 
idity, particularly when addressing queries related to the 
video content, which is the primary functionality of the 
VIP system. 

The video question-answering engine sources compre- 
hensive information regarding the video from the percep- 
tion and cognition engine. The perception and cognition 
engine identifies various visual and textual elements, such 
as action, object, and person, along with background, con- 
text, and event, to create a video scene graph that facilitates 
a comprehensive understanding of the video. The engine 
also uses content-based knowledge, world knowledge, and 
commonsense knowledge to construct the video scene 
graph. The dataset required for training and evaluating 
the video question-answering engine and perception and 
cognition engine is annotated from the video data man- 
agement engine. Additionally, each engine is pretrained 
using a large-scale collected dataset, enabling the platform 
to function swiftly on any given video. 

VIP is a vast and complex system, and VTT is performed 
based on the VideoQA task. Therefore, the following sec- 
tions explain VIP architecture, focusing primarily on the 
video question-answering engine. 

First, we introduce the VideoQA task used in general 
machine learning. In multiple-choice VideoQA, videos, 
questions, and several candidate answers are provided as 
inputs, and the model is required to identify the correct 
answer. Most works on VideoQA encode video features 
and textual features of QAs and infer correct answers 
using a deep spatio-temporal reasoning layer. In open- 
ended VideoQA, only the videos and questions are encoded 
and a language decoder is attached to generate the correct 
answer. In contrast to previous studies, the AI agent used in 
VTT was designed to consider hierarchical difficulty levels 
of questions using various QA modules. 

Vincent is an instantiation of the video question- 
answering engine of the proposed VIP, which was used 
as the AI agent in VTT. It comprises five modules each 
for multiple-choice QA and open-ended QA. As illustrated 
in Figure 3, a question difficulty estimation module, three 
VideoQA modules, and one answer selection module are 
included. Given a video and a corresponding question, 
Vincent infers the difficulty of the question using the 
question difficulty estimation module and uses the three 
VideoQA modules to infer an answer. The final answer is 


selected using the answer selection module by consider- 
ing the inferred difficulty of the question and the inference 
results of the three VideoQA modules simultaneously. 
Each VideoQA module exhibits individual characteristics, 
and Vincent uses them to derive optimal answers to vari- 
ous difficulty levels of questions. We briefly introduce each 
module’s architecture and the dataset used to train it. 


Question difficulty estimation 


The question difficulty estimation module accepts video 
frames as video inputs and subtitles and a question as tex- 
tual inputs. The module’s architecture is identical for both 
multiple-choice and open-ended question types since can- 
didate answers are not used. It encodes video inputs using 
a visual backbone (He et al. 2016) and textual inputs using 
a text encoder (Liu et al. 2019). The sequence of visual and 
textual inputs is then transmitted into the cross-attention 
encoder (Vaswani et al. 2017) to determine cross-modal 
temporal associations. Finally, the estimator classifies the 
difficulty of the given question based on the encoded uni- 
modal (i.e., visual and textual features) and cross-modal 
representations. The estimated question difficulty is sub- 
sequently used in the answer selection module. Four levels 
of difficulty were selected for each question in the VTT. 


Multiple-choice VideoQA 


To perform multiple-choice VideoQA, we developed mod- 
ules of a hierarchical memory-based QA, an attention- 
based QA, and a knowledge-based QA. 

The hierarchical memory-based QA module accepts 
video frames as video inputs and subtitles and a question 
as textual inputs. It encodes video inputs using a visual 
backbone (He et al. 2016) and a recurrent memory network 
(Hochreiter and Schmidhuber 1997), and textual inputs 
using pretrained word embedding (Pennington, Socher, 
and Manning 2014) and a recurrent memory network. 

Typical VideoQA frameworks use these low-level 
encoded representations directly. However, this module 
reconstructs character-guided high-level story repre- 
sentations for each sequence based on the low-level 
story representations by using an attention mechanism 
and the characters appearing in the question and each 
answer candidate as queries. Finally, it infers the correct 
answer using multilevel context matching based on the 
aforementioned story representations and QA. 

The attention-based QA module accepts the same 
inputs as the hierarchical memory-based QA module, but 
employs a different encoding strategy. It uses a visual trans- 
former encoder (Vaswani et al. 2017) for visual inputs and 


ISU SUOWIOD aANeaID afqvordde ay) Áq pourəao3 are saone YO ‘asn Jo SAMI 107 Aresqry] IUUQ ÁM UO (SUONIPUOD-pur-sUIA}/LUOD: Á M ÁreIqr oUr u0//:SdNY) SUONIPUOJ pUe SUNAL IP 92g “[EZOZ/TI/ET] UO Areagr] unuo ALAA ‘Purjog IWY Áq SZITIweL/ZOOT 01/10P/U02 Kopi: Arvquaurpuo//:sdyy Woy papropumog “p ‘EZOZ ‘IZIGILET 


= | fp 


AI MAGAZINE 


a language transformer encoder (Devlin et al. 2019) for 
textual inputs. With each sequence corresponding to both 
modalities, it uses a cross-modal transformer to calculate 
a final probability of each candidate answer. Using the 
cross-modal transformer, it adjusts the attention score by 
utilizing the character identity as attention prior. 

The knowledge-based QA module accepts only tex- 
tual inputs (i.e., subtitles, a question, and answer candi- 
dates). It extracts triplet relationships and conversation 
tokens from subtitles. Subsequently, the module exploits 
commonsense-based knowledge for each entity in the 
triplet using a 2-hop graph walk on a commonsense knowl- 
edge graph (Speer, Chin, and Havasi 2017). Finally, it 
encodes these inputs using a text encoder (Devlin et al. 
2019) and infers the correct answer. 


Open-ended VideoQA 


For open-ended VideoQA, we developed modules of 
metadata-based reasoning QA, an adaptive encoder-based 
QA, and a knowledge-based QA. All open-ended QA mod- 
ules utilize the same pretrained language decoder (Radford 
et al. 2019) to generate sentences, but exhibit different 
encoder architectures. 

The metadata-based reasoning QA module utilizes var- 
ious video metadata (i.e., characters, their behavior, and 
emotions) to encode video contexts (Lee et al. 2021). 
Although a variety of metadata is available for video con- 
tent, existing studies have not considered their utilization. 
This module uses video metadata to generate grounded 
answers. 

The adaptive encoder-based QA module uses the differ- 
ence between correct and wrong answers as a pretext loss 
to adapt the language encoder (Yu et al. 2021). It reduces 
the semantic gap between the visual encoder and the 
language decoder that are trained on different domains. 

The knowledge-based QA module utilizes the same 
encoder architecture as the knowledge-based QA module 
used for multiple-choice QA. Although it does not utilize 
the visual modality, it generates an answer based on scripts 
and other common-sensical contexts. 


Answer selection 


The answer selection module accepts video frames as 
visual inputs and a question, subtitles, and outputs 
answers from the QA modules as textual inputs. It encodes 
video inputs using a visual backbone (He et al. 2016) and 
a recurrent memory network, and textual inputs using a 
language encoder (Liu et al. 2019). After encoding each 
modality, both visual and textual sequence tokens are con- 


catenated into a single sequence and transmitted into the 
transformer encoder to produce a final feature. Finally, the 
answer selection module classifies the most appropriate 
answer for the given question using the final feature. 


DramaQA dataset 


We constructed the DramaQA dataset (Choi et al. 2021) to 
train the modules implemented within VIP. DramaQA is a 
video story QA dataset based on the drama “Another Miss 
Oh.” It contains 26,373 QA pairs corresponding to 16,402 
video clips of various lengths. Episodes 1-12 were used 
for training, and episodes 13-15 were used for validation. 
Episodes 16-18 were used for testing in VTT. Four diffi- 
culty levels were instituted for questions in the DramaQA 
dataset: simple recall on a single cue for level 1, simple 
recall on multiple cues for level 2, the recognition of sit- 
uational alterations for level 3, and reasoning to determine 
causality for level 4. 


ANALYSIS 
QA and juror decisions 


Three QA instances from the implemented VTT are pre- 
sented in Table 2. In each instance, a round, a question, a 
difficulty level, and the story elements associated with the 
question are described. In addition, the answers provided 
by players in different developmental stages are listed. 
Vincent and human participants answered the given ques- 
tion using diverse sentences. The generated answers were 
sometimes partially or completely incorrect. Each round 
involved three QAs, after which the jurors voted for the 
player that they believed to be the AI agent. The voting 
results in each round are listed in Table 3. Vincent received 
the majority of votes in the R1 (Emotion) and R5 (Inten- 
tion). However, in R2 (Commonsense), R3 (Background), 
and R4 (Conversation), one of the human players received 
the highest number of votes. 

In this section, we analyze the case of each round. In 
the R1 (Emotion) regarding emotion, most human play- 
ers correctly perceived characters’ emotions based on their 
speech and behavior in video scenes. However, Vincent 
sometimes failed to recognize the correct emotions, espe- 
cially when they were expressed metaphorically. Naturally, 
Vincent received the highest number of votes in the first 
round. 

On the other hand, Vincent received the fewest votes 
(just one vote) in the R2 (Commonsense) regarding phys- 
ical or social common-sensical knowledge. Compared 
to the human players, Vincent generates more logical 


ISU SUOWIOD aANeaID afqvordde ay) Áq pourəao3 are saone YO ‘asn Jo SAMI 107 Aresqry] IUUQ ÁM UO (SUONIPUOD-pue-sWIA}/LUOD Á M ÁreIqr Ur U0//:SdNY) suOMIPUOD pur SUNAL IP 99g “[EZOZ/TI/ET] UO AreAgr] unuo ATLA ‘Purjog IWY Áq SZIZIWwL/ZOOT 01/10P/W02 Kopi: Arviquurpuo//:sdyy Woy papropuMmog “p ‘EZOZ ‘IZIGILET 


AI MAGAZINE p | 547 


TABLE 2 Open-ended question and answer examples sampled from VTT. We show the scores of the answers in parenthesis obtained by 


measuring the correctness, specificity, and making sense in the context of each answer based on the analytic rubrics. 


Round: R2 (Commonsense) 
Question: Why did the boy stand up? 
Difficulty level: 4 


Story elements: Character, Event, Commonsense/Motivation/Reasoning 


AI: Vincent Because the boy thought that the woman who was talking to him was pregnant. (1.0) 


6 years Because Sukyung wants to sit on the seat. (0.57) 
9 years Sukyung lies that Sukyung was pregnant. (0.43) 
12 years Because Sukyung was pregnant. (1) 


18 years Because Sukyung told him that she was pregnant. (1) 


Round: R3 (Background) 

Question: What is passing by outside the window? 
Difficulty level: 1 

Story elements: Object, Place/Identity/Recall 

AI: Vincent A car is passing by outside the window. (0.6) 


6 years The car is passing by outside the window. (0.6) 
9 years Blue bus is passing by outside the window. (1) 
12 years A car passed outside the window. (0.6) 

18 years The bus passed outside the window. (1) 


Round: R5 (Action Intention) 


Question: What did Haeyoung show to Dokyung? 

Difficulty level: 3 

Story elements: Character, Object, Behavior/Identity/Recognition 

AI: Vincent Haeyoung showed the happy memories to Dokyung. (0.67) 


6 years Haeyoung showed her ring to Dokyung. (0.5) 
9 years Haeyoung showed accessories. (0.67) 

12 years Haeyoung showed Dokyung something. (0.33) 
18 years Haeyoung showed her bracelet. (1) 


ISU SUOWIOD ANLAJ afqvordde ay) Áq pourəao3 are saone YO ‘asn Jo SAMI 107 Aresqry] IUUQ ÁM UO (SUONIPUOD-pure-sWIA}/LUO: Á M ÁreIqr oUr u0//:SdNY) SUONIPUOJ pur SUNAL IP 99g “[EZOZ/TI/ET] UO Aeaqrq OUTTUG ATLA ‘Purjog IWY Áq SZITIWL/ZOOT 01/10P/U02 Kopi: Arviquaurfuo//:sdyy Woy papropumog “p ‘EZOZ ‘IZIGILET 


| fp 


AI MAGAZINE 


TABLE 3 


Voting results of jurors in VTT. The players with the highest number of votes in each round are highlighted in the background 


color. The jurors detected Vincent in the first and fifth rounds. The two numbers in parentheses represent the number of votes from Al-expert 


and non-Al-expert jurors. The players who were guessed to be AI agents by the two groups in each round are represented using boldface and 


underline, respectively. 


Round AI: Vincent 6 years 
R1: Emotion 10 8 
(5/5) (3/5) 
R2: Commonsense 1 10 
(1/0) (0/10) 
R3: Background 7 3 
(3/4) (1/2) 
R4: Conversation 5 7 
(5/0) (3/4) 
R5: Intention 11 5 
(6/5) (3/2) 


answers to the given questions. However, an AI expert 
opined that Vincent’s answer “Because the boy thought that 
the woman who was talking to him is pregnant” was too log- 
ical and contained too much information. The QA instance 
in the second round has been listed as the first example in 
Table 2. 

In the R3 (Background) regarding background informa- 
tion not essential to understanding video content, very 
short video clips (between 2 and 6 s in length) were used. 
As mentioned previously, the jurors’ expectation that Vin- 
cent would perform well in this round was corroborated— 
Vincent exhibited the highest accuracy with decent scene 
recognition ability. However, the jurors struggled to iden- 
tify Vincent in this round as well. Note that all players and 
jurors were asked to watch each video clip first, and then 
answer a question related to it. Thus, the video clip was not 
replayed after the question was presented. The 9-year-old 
human player received the highest number of votes in the 
third round. As depicted in the second example in Table 2, 
the 9-year-old player correctly recognized the object pass- 
ing by the window to be a blue bus. The specificity of her 
answer drew suspicion from the jurors. 

The R4 (Conversation) concerned conversational con- 
texts between characters based on their relationships. The 
corresponding questions were very challenging, requiring 
a comprehensive understanding of the video. Both Vincent 
and the human players provided diverse incorrect answers 
in this round. As a result, the votes were divided, and the 
9- and 12-year-old human players received the most votes. 

The R5 (Intention) concerned the identification of inten- 
tions behind actions. In general, humans rely on their 
direct and indirect experiences to adjudge such inten- 
tions. Thus, children and AI agents, both of whom lack 
the necessary experiences, were expected to struggle to 
answer the questions correctly. Both Vincent and the 6- 
year-old human player exhibited similar low accuracies. 


9 years 12 years 18 years Result 

3 8 5 Fail 
(0/3) (4/4) (4/1) 

9 6 7 Pass 
(8/1) (3/3) (3/4) 

14 4 2 Pass 
(6/8) (3/1) (2/0) 

8 8 6 Pass 
(4/4) (1/7) (3/3) 

6 4 5 Fail 
(1/5) (2/2) (4/1) 


However, Vincent generated more abstract and ambigu- 
ous responses, and, consequently, received the most votes 
(third example in Table 2). 

The voting results also revealed that Al-expert jurors dis- 
tinguished Vincent from the human players with greater 
success than the non-Al-expert jurors. 


QA accuracy with respect to difficulty 
levels 


In this section, we compare the VideoQA accuracies of 
Vincent and the human players to evaluate their video 
comprehension ability in detail. We especially focus on 
analyzing QA accuracy in terms of question difficulty. The 
VideoQA accuracies with respect to the four question dif- 
ficulty levels and two question types (i.e., multiple-choice 
and open-ended questions) are listed in Table 4. Note that 
the presented experimental results are instances selected 
from the implemented VTT, and may not be generalizable 
owing to the small samples of questions and representative 
age groups. Nevertheless, several notable observations are 
presented below. 

First, with progression in human developmental stages, 
the average accuracy on multiple-choice and open-ended 
VideoQA was observed to increase gradually on both 
multiple-choice and open-ended QA. This is attributed to 
the maturity of video comprehension ability with age. Sec- 
ond, accuracy on open-ended questions was higher than 
that on the multiple-choice questions for both Vincent and 
human players. This is attributed to the binary evaluation 
of multiple-choice questions (i.e., 0 or 1) compared to that 
based on the sum of partial scores corresponding to story 
elements or essential terms and concepts of open-ended 
questions. The latter metric was defined based on analytic 
rubrics by the evaluators specializing in cognitive science. 


ISU SUOWWOJ IANLAJ arqvordde ay) Áq pourəao3 are saone YO ‘asn Jo SAMI 107 Aresqry] IUUQ ÁM UO (SUONIPUOD-purL-sWHIA}/LUO Á M ÁreIqr oUr u0//:SdNY) SUONIPUOJ pur SUNAL IP 99g “[EZOZ/TI/ET] UO AreAgr] OUTTUG ALAA ‘Purjog IWY Áq SZIZIweL/ZOOT 01/10P/U02 Kopi: Arvquaurpuo//:sdyy Woy papropumog “p ‘EZOZ ‘IZIGILET 


AI MAGAZINE 


VNE 


TABLE 4 Accuracies of players from different developmental stages on multiple-choice and open-ended VideoQA. Diff. is an 


abbreviation for question difficulty. The average and achievement are averaged over all questions, and the percentages of Vincent’s 


achievement are obtained by comparing its accuracy to that of the player. 


Multiple-choice VideoQA Diff. 1 Diff. 2 
AI: Vincent 33.3 66.7 
6 years 66.7 0.0 
9 years 66.7 100.0 
12 years 66.7 66.7 
18 years 66.7 100.0 
Open-ended VideoQA Diff. 1 Diff. 2 
AI: Vincent 70.0 90.0 
6 years 90.0 70.7 
9 years 80.0 85.7 
12 years 70.0 82.1 
18 years 80.0 100.0 
Total Diff. 1 Diff. 2 
AI: Vincent 54.3 80.0 
6 years 80.0 40.4 
9 years 74.3 91.8 
12 years 68.6 75.5 
18 years 74.3 100.0 


Third, even though the scales of accuracies on multiple- 
choice and open-ended QAs were different, Vincent per- 
formed comparably to the 6-year-old human player on both 
types (i.e., 100% achievement on multiple-choice QA and 
110.4% achievement on open-ended QA). 

Finally, the observed video comprehension ability in 
terms of human developmental stages followed the trend 
reported in previous works. For instance, children strug- 
gle to distinguish plot-relevant content from less essential 
information during video comprehension (Calvert et al. 
1982; Singer and Singer 1983). In addition, children usu- 
ally emphasize the actions of characters in recall-based 
tasks, whereas adults focus on characters’ goals and events 
that initiate these goals (Van den Broek, Lorch, and Thur- 
low 1996). In our experiments, younger participants (i.e., 
6- and 9-year-old players) achieved somewhat lower scores 
on questions of difficulty levels 3 and 4 as they required 
reasoning regarding temporal relationships and causal- 
ity between events. Further, the youngest human player 
achieved prominently low scores on the questions of diffi- 
culty level 2. We conjecture that younger children struggle 
to perform rapid and complex reasoning operations that 
require combining multiple supporting facts appearing in 
the short video clip. 


Analysis using cognitive modules 


To analyze each player’s video comprehension ability, we 
evaluated their performance in terms of literary elements 


Diff. 3 Diff. 4 Average Achievement 

50.0 66.7 53.3 = 

66.7 66.7 53.3 100.0 

66.7 66.7 73.3 72.7 
100.0 66.7 80.0 66.7 
100.0 66.7 86.7 61.5 
Diff. 3 Diff. 4 Average Achievement 

83.3 74.3 78.5 = 

75.0 54.8 71.1 110.4 

83.3 71.4 79.1 99.3 

66.7 94.3 80.9 97.1 
100.0 100.0 94.7 83.0 
Diff. 3 Diff. 4 Average Achievement 

58.3 71.4 65.9 = 

68.8 59.2 62.2 106.0 

70.9 69.6 76.2 86.5 

91.7 83.7 80.4 82.0 
100.0 87.5 90.4 72.7 


within the story based on CogME (Shin et al. 2021). We 
calculated the correct predictions corresponding to each 
story element related to the questions. The performance 
profiles of the human players and Vincent with respect to 
three cognitive modules (i.e., Target, Content, and Think- 
ing) are depicted in Figure 4. The score corresponding to 
each story element was normalized using the correspond- 
ing scores of the 18-year-old human player to compare 
the achievement of Vincent on each story element with 
those of the human players. The average profile of the four 
human players is indicated by a blue line, with the vari- 
ance between the minimum and maximum accuracy of the 
players indicated by the light blue region. The profile of 
Vincent is represented by an orange line. 

The performance of Vincent corresponding to most 
story elements of the cognitive modules was observed to 
lie within the performance variance range of the human 
players. In particular, Vincent exhibited a fine perception 
ability for objects (i.e., Object in the Target module), and 
characteristics of characters and objects (i.e., Feature in the 
Content module). On the other hand, Vincent exhibited 
a limited capability for several story elements (e.g., Place 
in the Target module, Context and Causality in the Con- 
tent module, and Recall in the Thinking module). Fewer 
than five questions were associated with the aforemen- 
tioned categories, inducing a strong penalty when Vincent 
failed to infer the correct answer to a question within the 
categories. We would like to emphasize once again that 
the experimental results are selected instances from the 
implemented VTT, which may not be generalizable. 


ISU SUOWIOD ANLAJ afqvordde ay) Áq pourəao3 are saone YO ‘asn Jo SAMI 107 Aresqry] IUUQ ÁM UO (SUONIPUOD-purL-sWLIA}/LUOD: Á M ÁreIqr oUr u0//:SdNY) SUONIPUOJ pur SUNAL IP 99g “[EZOZ/TI/ET] UO AreAqr] unuo ATLA ‘Purjog IWY Áq SZIZIWweL/ZOOT 01/10P/U02 Kopi: Arvsquaurpuo//:sdyy Woy papropumog “p ‘EZOZ ‘IZIGILET 


= | fp 


AI MAGAZINE 


Place Relation 


omental 


| 


| | 
Character Context 
| 


MW/ \ 


Eve 


Emotion 


— i, 


Causality 


Motivation 
— hens. A 


FIGURE 4 Performance profiles of humans and Vincent (AI) in the Video Turing Test (VTT). The blue and orange lines represent the 
ratio of correct predictions for each story element by the humans (average of four human players) and Vincent, respectively. The variances 


between the minimum and maximum accuracies of the human players are indicated by the light blue region. 


DISCUSSION AND CONCLUSIONS 
Significance of VTT 


We designed VTT to evaluate the video comprehension 
level of AI agents. To this end, we presented evaluation 
methods to compensate for the limitations of the Turing 
Test. We verified their validity by applying them on a trial 
basis. 

In terms of VTT composition, we (1) compared the 
answers of human players in various age groups and 
an AI agent to enable evaluation based on cognitive 
development by age, and (2) organized and evaluated 
the rounds by subject considering the ability required to 
answer the questions. While composing the questions, we 
(3) assigned difficulty levels by considering the required 
memory capacity and reasoning complexity. Moreover, we 
(4) examined the advantages and disadvantages of each 
multiple-choice and open-ended question. In terms of 
evaluation criteria, we (5) proposed the number of votes, 
overall accuracy, achievement for each age group, accu- 
racy with respect to the question difficulty, and CogME 
analysis for refined evaluation of video comprehension 
intelligence. 

Vincent passed in three out of five rounds in VTT, 
exhibiting accuracy similar to that of a 6-year-old partici- 
pant. Furthermore, the analysis results that were obtained 
using the proposed evaluation measures were discussed. 
However, the reliability of the results may not be high 
owing to insufficient numbers of participants, questions, 
and voters. Therefore, the results of VTT should be used 
as guidelines for evaluating the advantages and disad- 
vantages of current AI and identifying issues that merit 
investigation, rather than as results of rigorous scientific 
analysis. 


The implemented VTT is a meaningful first step in the 
development of video-understanding intelligence, which 
demonstrates the possibility of developing human-level 
AI with complex intelligence in faculties such as voice, 
language, and vision, beyond single-intelligence Als, for 
example, those in chess, quiz competitions, and Go. This, 
in turn, is expected to find applications in various indus- 
trial fields, such as communication at home, care for 
the elderly, and audiovisual interactive education in the 
metaverse. 


Future research directions 


This study employed single-turn QA in the VTT to exam- 
ine video understanding capabilities. It should be noted 
that VTT is not limited in format and can be extended 
to various related video comprehension tasks, such as 
video story captioning, story-related dialog generation, and 
future story prediction. Here, we propose several future 
research directions to explore further and develop VTT 
paradigm. 


Complementary multimodal integration: Recent 
video-understanding studies have attempted to 
develop multimodal integration models, extending 
from “video + text” (Zellers et al. 2021) to “video + 
text + audio” (Zellers et al. 2022), to learn video rep- 
resentations. However, their development remains 
in the early stages. Extensive research is required to 
implement cross-modality and utilize complemen- 
tarity in a manner comparable to how a visually 
impaired person grasps information via touch and 
hearing. 

Longer videos: Most VideoQA research has eval- 
uated video comprehension ability on shot- and 


ISU SUOWIOD aANaID arqvordde ay) Áq pourəao3 are saone YO ‘asn Jo SAMI Joy AreAqry] IUUQ ÁM UO (SUONIPUOD-pure-sWLIA}/LUOD Á M ÁreIqr oUr u0//:SdNY) SUONIPUOJ pur SUNAL IP 92g “[EZOZ/TI/ET] UO AreAQr] OUTTUG ÁM ‘Purjog IWY Áq BZ IZI ReL/ZO0T 01/10P/W02 Kopi: Arviquaurpuo//:sdyy Woy papropumog “p ‘EZOZ ‘IZIGILET 


AI MAGAZINE 


I= 


scene-length videos. Models that can handle longer 
contexts should be developed in the future (Wu and 
Krahenbuhl 2021; Soldan et al. 2022). 

Multiturn dialog: In this study, we considered 
single-turn QA. In the future, models that are capa- 
ble of multiturn QA or dialog should be proposed 
to enable mutual inquiry regarding questions in the 
form of a dialog (Lee, Heo, and Zhang 2018; Alamri 
et al. 2019). The integration of VideoQA and learn- 
ing based on inquiry also merits investigation to 
enable AI agents to expand their knowledge via 
interaction in the case of uncertainty or the need 
for additional knowledge. 

Multitask learning for video comprehension: 
In this study, we developed and evaluated video- 
understanding intelligence with a focus on 
VideoQA. Video-understanding AI can also be 
advanced based on multitask learning, such as 
video captioning and video retrieval. 

Continual video comprehension: The development 
of a model that is capable of online, incremental 
learning while minimizing catastrophic forgetting 
for a variety of video data in the real world will aid 
the development of video-understanding AI. 

Adversarial learning for VTT: Evaluation using 
human jurors is subjective and the verification 
of the validity of answers to difficult questions 
is complicated. Therefore, adversarial training of 
AI jurors and AI agents may be implemented 
instead of human jurors to enable better classi- 
fication and the generation of more human-like 
answers. 

Optimization and real-time systems: Real-time 
performance is crucial for practical applications. 
VIP integrates and implements various modules, 
and therefore, has a complicated structure. Fur- 
thermore, the overall system performance becomes 
poor when errors accumulate in the modules. 
Thus, the design and configuration of the video- 
understanding system should be analyzed to reduce 
the system response latency and process videos in 
real time. 


Beyond VTT 


The following tasks can be considered as subsequent 
milestones in the development of AGI. 

From human-likeness to human augmentation: 
Turing considered whether a machine could imitate a 
human. Since then, the development of AI matching 
human intelligence has been considered a major goal. 


Significant strides have been made in this regarding. In 
particular, human-like AI technology is essential for direct 
conversation with people and in cases involving emotional 
interaction. However, not all types of AI agents should 
be human-like. In the future, focusing on augmenting 
humans rather than simply mimicking them could enable 
new capabilities and services. 

Embodied Turing Test: Embodied intelligence, that is, 
sensory learning, is an essential learning method in liv- 
ing beings. Thus, the development of AI technology that 
interacts with humans using an embodied agent inter- 
acting directly with the environment through movement, 
manipulation, observation, and conversation will bring 
about a remarkable development in AI. To this end, it is 
necessary to progress from machine learning to a learn- 
ing machine, which constructs a world model based on 
embedded cognition and a perception—-cognition-action 
learning cycle. We expect that the next-generation Turing 
Test, after VIT, will be an Embodied Turing Test in the 
real-world environment. 


ACKNOWLEDGMENTS 

We sincerely appreciate our colleagues from the VTT team 
for their valuable collaboration in this project. Their vast 
expertise, insights, and dedication greatly elevated the 
quality of our research. 


CONFLICT OF INTEREST STATEMENT 
The authors declare that there is no conflict. 


ORCID 

Minsu Lee ® https://orcid.org/0000-0002-9601-3863 
Yu-Jung Heo ® https://orcid.org/0000-0002-5725-9545 
Seongho Choi ® https://orcid.org/0000-0002-7553-6761 
Woo Suk Choi ® https://orcid.org/0009-0001-8091-347X 
Byoung-Tak Zhang Ô https://orcid.org/0000-0001-9890- 
0389 


REFERENCES 


Adams, S. S., G. Banavar, and M. Campbell. 2016. “I-athlon: Towards 
a Multidimensional Turing Test.” AI Magazine 37(1): 78-84. 

Adiwardana, D., M.-T. Luong, D. R. So, J. Hall, N. Fiedel, R. 
Thoppilan, Z. Yang, et al. 2020. “Towards a Human-like Open- 
domain Chatbot.” CoRR, abs/2001.09977. 

Alamri, H., V. Cartillier, A. Das, J. Wang, A. Cherian, I. Essa, D. 
Batra, et al. 2019. “Audio Visual Scene-aware Dialog.” In Proceed- 
ings of the IEEE/CVF Conference on Computer Vision and Pattern 
Recognition, 7558-67. 

Banerjee, S., and A. Lavie. 2005. “METEOR: An Automatic Met- 
ric for MT Evaluation with Improved Correlation with Human 
Judgments.” In Proceedings of the ACL Workshop on Intrinsic 
and Extrinsic Evaluation Measures for Machine Translation and/or 
Summarization, 65-72. 


ISU SUOWIOD IANLAJ arqvordde ay) Áq pourəao3 are saone YO ‘asn Jo SAMI 107 Aresqry] IUUQ ÁM UO (SUONIPUOI-puvĽ-sw19/W09 Á M ÁreIqr Ur u0//:SdNY) SUONIPUOJ pur SUNAL IP 99g “[EZOZ/TI/ET] UO AreAQr] unuo ATLA ‘Pur[og IWY Áq SZIZIWweL/ZOOT 01/10P/U02 Kopi: Arviquaurpuo//:sdyy Woy papropumog “p ‘EZOZ ‘IZIGILET 


=| Upp 


AI MAGAZINE 


Bebensee, B., and B.-T. Zhang 2021. “Co-Attentional Transformers 
for Story-based Video Understanding.” In IEEE International Con- 
ference on Acoustics, Speech and Signal Processing, ICASSP 2021, 
Toronto, ON, Canada, June 6-11, 2021, 4005-9. IEEE. 

Boden, M. A. 2006. Mind as Machine: A History of Cognitive Science, 
volume 1. New York: Oxford University Press. 

Calvert, S. L., A. C. Huston, B. A. Watkins, and J. C. Wright. 1982. 
“The Relation Between Selective Attention to Television Forms 
and Children’s Comprehension of Content.” Child Development 
53: 601-10. 

Choi, S., K.-W. On, Y.-J. Heo, A. Seo, Y. Jang, M. Lee, and B.-T. Zhang. 
2021. “DramaQA: Character-centered Video Story Understanding 
with Hierarchical QA.” In Proceedings of the AAAI Conference on 
Artificial Intelligence, volume 35, 1166-74. 

Devlin, J., M.-W. Chang, K. Lee, and K., Toutanova. 2019. “BERT: 
Pre-training of Deep Bidirectional Transformers for Language 
Understanding.” In Proceedings of the 2019 Conference of the North 
American Chapter of the Association for Computational Linguistics: 
Human Language Technologies, (Long and Short Papers), volume 
1, 4171-86. 

Diba, A., M. Fayyaz, V. Sharma, M. Paluri, J. Gall, R. Stiefelhagen, 
and L. V. Gool. 2020. “Large scale holistic video understanding.” 
In European Conference on Computer Vision, 593-610. Springer. 

DramaQA challenge on ECCV. 2020. “The 2nd DramaQA Chal- 
lenge.” Accessed: August 2020. https://dramaqa.snu.ac.kr/ 
Challenge/2020 

DramaQA challenge on KSC. 2019. “The 1st DramaQA Challenge.” 
Accessed: November 2019. https://dramaqa.snu.ac.kr/Challenge/ 
2019 

Ekman, P. 1989. “The Argument and Evidence About Universals 
in Facial Expressions.” In Handbook of Social Psychophysiology, 
volume 143, 164. Chichester: Wiley. 

Geman, D., S. Geman, N. Hallonquist, and L. Younes. 2015. “Visual 
turing test for computer vision systems.” Proceedings of the 
National Academy of Sciences 112(12): 3618-23. 

He, K., X. Zhang, S. Ren, and J. Sun. 2016. “Deep Resid- 
ual Learning for Image Recognition.” In Proceedings of the 
IEEE/CVF Conference on Computer Vision and Pattern Recognition 
(CVPR). 

Heo, Y.-J., M. S. Lee, S. Choi, W. S. Choi, M. Shin, M. Jung, J.-K. Ryu, 
and B.-T. Zhang. 2021. “Toward a Human-level Video Understand- 
ing Intelligence.” AAAI Fall Symposium on Artificial Intelligence 
for Human-Robot Interaction. 

Heo, Y.-J., K.-W. On, S.-H. Choi, J. Lim, J. Kim, J.-K. Ryu, B.-C. Bae, 
and B.-T. Zhang. 2019. “Constructing Hierarchical Q&A Datasets 
for Video Story Understanding.” AAAI Spring Symposium on 
Story-Enabled Intelligence. 

Hochreiter, S., and J. Schmidhuber 1997. “Long Short-term Memory.” 
Neural Computation 9(8): 1735-80. 

Jain, N., S., Kumar, A. Kumar, P. Shamsolmoali, and M. Zareapoor. 
2018. “Hybrid Deep Neural Networks for Face Emotion Recogni- 
tion.” Pattern Recognition Letters 115: 101-6. 

Jarrold, W., and P. Z. Yeh 2016. “The Social-emotional Turing 
Challenge.” AI Magazine 37(1): 31-38. 

Kim, J., M. Ma, K. Kim, S. Kim, and C. D. Yoo. 2019. “Progressive 
Attention Memory Network for Movie Story Question Answer- 
ing.” In Proceedings of the IEEE/CVF Conference on Computer 
Vision and Pattern Recognition (CVPR). 


Kim, J., M. Ma, T. Pham, K. Kim, and C. D. Yoo. 2020. “Modal- 
ity Shifting Attention Network for Multi-modal Video Question 
Answering.” In IEEE/CVF Conference on Computer Vision and 
Pattern Recognition (CVPR). 

Kim, J., S. Yoon, D. Kim, and C. D. Yoo. 2021. “Structured Co- 
reference Graph Attention for Video-grounded Dialogue.” Pro- 
ceedings of the AAAI Conference on Artificial Intelligence 35(2): 
1789-97. 

Kim, K.-M., S.-H. Choi, J.-H. Kim, and B.-T. Zhang. 2018. “Multi- 
modal Dual Attention Memory for Video Story Question Answer- 
ing.” In Proceedings of the European Conference on Computer 
Vision (ECCV). 

Lee, D., S. Choi, Y. Jang, and B.-T. Zhang. 2021. “Mounting Video 
Metadata on Transformer-based Language Model for Open-ended 
Video Question Answering.” CoRR, abs/2108.05158. 

Lee, S.-W., Y.-J., Heo, and B.-T., Zhang. 2018. “Answerer in Ques- 
tioner’s Mind: Information Theoretic Approach to Goal-oriented 
Visual Dialog.” In Advances in Neural Information Processing 
Systems (NeurIPS), volume 31. 

Lei, J., L., Yu, M., Bansal, and T. L., Berg. 2018. “TVQA: Local- 
ized, Compositional Video Question Answering.” In Proceedings 
of the 2018 Conference on Empirical Methods in Natural Language 
Processing, 1369-79. 

Lenat, D. B., and R. V. Guha 1989. Building LARGE Knowledge-based 
Systems, Representation and Inference in the Cyc Project. Reading: 
Addison-Wesley Longman Publishing Co., Inc. 

Levesque, H., D. Ernest, and M. Leora 2012. “The Winograd Schema 
Challenge.” In Thirteenth International Conference on the Princi- 
ples of Knowledge Representation and Reasoning. 

Liu, Y., M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, 
L. Zettlemoyer, and V. Stoyanov. 2019. “RoBERTa: A Robustly 
Optimized BERT Pretraining Approach.” CoRR, abs/1907. 
11692. 

McKinstry, C. 1997. “Minimum Intelligent Signal Test: An Objective 
Turing Test.” Canadian Artificial Intelligence 41: 17-18. 

Miech, A., D. Zhukov, J.-B. Alayrac, M. Tapaswi, I. Laptev, and J. 
Sivic. 2019. “Howtol00m: Learning a Text-Video Embedding by 
Watching Hundred Million Narrated Video Clips.” In Proceed- 
ings of the IEEE/CVF International Conference on Computer Vision 
(CVPR), 2630-40. 

Na, S., S. Lee, J. Kim, and G. Kim. 2017. “A Read-write Memory Net- 
work for Movie Story Understanding.” In The IEEE International 
Conference on Computer Vision (ICCV). 

Olague, G., M. Olague, A. R. Jacobo-Lopez, and G. Ibarra-Vázquez. 
2021. “Less Is More: Pursuing the Visual Turing Test With the 
Kuleshov Effect.” In Proceedings of the IEEE/CVF Conference on 
Computer Vision and Pattern Recognition (CVPR), 1553-61. 

Papineni, K., S. Roukos, T. Ward, and W.-J. Zhu. 2002. “BLEU: 
A Method for Automatic Evaluation of Machine Translation.” 
In Proceedings of the 40th Annual Meeting of the Association for 
Computational Linguistics, 311-18. 

Pennington, J., R. Socher, and C. D. Manning. 2014. “GloVe: Global 
Vectors for Word Representation.” In Proceedings of the conference 
on Empirical Methods in Natural Language Proceedings (EMNLP), 
volume 14, 1532-43. 

Radford, A., J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever. 
2019. “Language Models are Unsupervised Multitask Learners.” 
OpenAI blog 1(8): 9. 


ISU SUOWIOD aANeaID afqvordde ay) Áq pourəao3 are saone YO ‘asn Jo SAMI 107 Aresqry] IUUQ ÁM UO (SUONIPUOD-pure-sWHIA}/LUOD Á M ÁreIqr oUr u0//:SdNY) SUONIPUOJ pUe SUNAL IP 99g “[EZOZ/TI/ET] UO AreAgr] OUTTUG ALAA ‘PurjoOg IWY Áq SZIZIWL/ZOOT 01/10P/W02 Kopi: Arviqueurpuo//:sdyy Woy papropumog “p ‘EZOZ ‘IZIGILET 


AI MAGAZINE 


In 


Ryu, H., S. Kang, H. Kang, and C. D. Yoo. 2021. “Semantic Group- 
ing Network for Video Captioning.” Proceedings of the AAAI 
Conference on Artificial Intelligence 35(3): 2514-22. 

Schoenick, C., P. E. Clark, O. Tafjord, P. Turney, and O. Etzioni 
2017. “Moving Beyond the Turing Test with the Allen AI 
Science Challenge.” Communications of the ACM 60(9): 60- 
64. 

Shieber, S. M. 1994. “Lessons from a Restricted Turing Test.” 
Communications of the Association for Computing Machinery 37: 
70-78. 

Shin, M., J. Kim, S.-H. Choi, Y.-J. Heo, D. Kim, M. Lee, B.-T. Zhang, 
and J.-K. Ryu. 2021. “CogME: A Novel Evaluation Metric for Video 
Understanding Intelligence.” CoRR, abs/2107.09847. 

Singer, J. L., and Singer, D. G. 1983. “Implications of Child- 
hood Television Viewing for Cognition, Imagination, and Emo- 
tion.” In Children’s Understanding of Television: Research on 
Attention and Comprehension, 265-95. New York: Academic 
Press. 

Soldan, M., A. Pardo, J. L. Alcatzar, F. Caba Heilbron, C. Zhao, S. 
Giancola, and B. Ghanem. 2022. “MAD: A Scalable Dataset for 
Language Grounding in Videos from Movie Audio Descriptions.” 
In Proceedings of the IEEE/CVF Conference on Computer Vision 
and Pattern Recognition (CVPR), 5026-35. 

Speer, R., J. Chin, and C. Havasi. 2017. “ConceptNet 5.5: An Open 
Multilingual Graph of General Knowledge.” In Proceedings of the 
Thirty-First AAAI Conference on Artificial Intelligence, AAAT’17, 
4444-51. AAAI Press. 

Tapaswi, M., Y. Zhu, R. Stiefelhagen, A. Torralba, R. Urtasun, and S. 
Fidler. 2016. “MovieQA: Understanding Stories in Movies through 
Question-answering.” In Proceedings of the IEEE/CVF Conference 
on Computer Vision and Pattern Recognition, 4631-40. 

Turing, A. 1950. “Computing Machinery and Intelligence.” Mind 
59(236): 433-60. 

Van den Broek, P., E. P. Lorch, and R. Thurlow. 1996. “Children’s and 
Adults’ Memory for Television Stories: The Role of Causal Fac- 
tors, Story-grammar Categories, and Hierarchical Level.” Child 
Development 67(6): 3010-28. 

Vaswani, A., N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. 
Gomez, Ł. Kaiser, and I. Polosukhin. 2017. “Attention is All You 
Need.” In Advances in Neural Information Processing Systems 
(NeurIPS), volume 30. 

Vedantam, R., C. Lawrence Zitnick, and D. Parikh. 2015. “Cider: 
Consensus-based Image Description Evaluation.” In Proceedings 
of the IEEE Conference on Computer Vision and Pattern Recognition 
(CVPR), 4566-75. 

VTT workshop on ECCV. 2020. “The 2nd Workshop on Video Turing 
Test: Toward Human-level Video Story Understanding.” Accessed: 
August 2020. https://dramaqa.snu.ac.kr/Workshop/2020 

VTT workshop on ICCV. 2019. “The 1st Workshop on Video Turing 
Test: Toward Human-level Video Story Understanding.” Accessed: 
November 2019. https://dramaqa.snu.ac.kr/Workshop/2019 

Wu, C.-Y., and P. Krahenbuhl 2021. “Towards Long-form Video 
Understanding.” In Proceedings of the IEEE/CVF Conference on 
Computer Vision and Pattern Recognition, 1884-94. 

Yu, Y., J. Kim, and G. Kim 2018. “A Joint Sequence Fusion Model for 
Video Question Answering and Retrieval.” In Proceedings of the 
European Conference on Computer Vision (ECCV). 

Yu, Y., J. Kim, H. Yun, J. Chung, and G. Kim. 2020. “Char- 
acter Grounding and Re-identification in Story of Videos and 


Text Descriptions.” In Proceedings of the European Conference on 
Computer Vision (ECCV). 

Yu, Y., J. Chung, H. Yun, J. Kim, and G. Kim. 2021. “Transitional 
Adaptation of Pretrained Models for Visual Storytelling.” In Pro- 
ceedings of the IEEE/CVF Conference on Computer Vision and 
Pattern Recognition (CVPR), 12658-68. 

Zellers, R., J. Lu, X. Lu, Y. Yu, Y. Zhao, M. Salehi, A. Kusupati, J. 
Hessel, A. Farhadi, and Y. Choi. 2022. “Merlot Reserve: Neural 
Script Knowledge through Vision and Language and Sound.” In 
Proceedings of the IEEE/CVF Conference on Computer Vision and 
Pattern Recognition, 16375-87. 

Zellers, R., X. Lu, J. Hessel, Y. Yu, J. S. Park, J. Cao, A. Farhadi, 
and Y. Choi. 2021. “Merlot: Multimodal Neural Script Knowledge 
Models.” Advances in Neural Information Processing Systems 34: 
23634-51. 


How to cite this article: Lee, M., Y.-J. Heo, S. 
Choi, W. S. Choi, and B.-T. Zhang. 2023. “Video 
Turing Test: A first step towards human-level AI.” 
AI Magazine 44: 537-554. 
https://doi.org/10.1002/aaai.12128 


AUTHOR BIOGRAPHIES 


Minsu Lee is a Research Professor at the Artificial 
Intelligence Institute of Seoul National University. She 
earned her Ph.D. in Computer Science and Engineering 
from Ewha Womans University. Before joining Seoul 
National University, she served as a Research Professor 
at Ewha Womans University and was a Visiting Scholar 
at Indiana University, USA. Her recent research focuses 
on active learning and machine learning techniques 
inspired by human learning mechanisms. Currently, 
she leads projects on goal-oriented self-supervised rein- 
forcement learning and learning by asking questions 
for video understanding. 


Yu-Jung Heo is a Principal Research Engineer at KT 
AI2XL. She received her Ph.D. degree in Computer 
Science and Engineering from Seoul National Univer- 
sity in 2023. Prior to this, she completed her BS in 
Computer Science and Information Engineering and 
IT-based Management from Inha University in 2015. 
Her research focuses on multimodal representation 
learning and knowledge-enhanced reasoning. 


Seongho Choi is a Ph.D. candidate in Computer Sci- 
ence and Engineering at Seoul National University. He 
earned his Bachelor of Science degree in Computer 
Science and Engineering from the same institution. His 


ISU SUOWWOJ aANeaD AqVərdde ay) Áq pourəao3 are saone YO ‘asn JO SAMI 107 Aresqry] IUUQ ÁM UO (SUONIPUOI-puvĽ-sw19/W09 Á M ÁreIqr Ur U0//:SdNY) SUONIPUOJ pUe SUNAL IP 99g “[EZOZ/TI/ET] UO AreAgr] OUTTUG KLM ‘Purjog auRIYDOD Áq 8ZIZI ReL/ZO0T 01/10P/W02 Kam: AreIqrouruo//:sdny Woy papeoumod *p ‘EZOZ ‘IZIGILET 


=| fp 


AI MAGAZINE 


research interest focuses on the task of multimodal 
video narratives, demanding a deep understanding of 
both visual content and language. His work has been 
published in prominent conferences such as AAAI. 


Woo Suk Choi is a Ph.D. candidate in the interdis- 
ciplinary program in Neuroscience at Seoul National 
University, and received his BS degree in Computer 
Engineering from the University at Buffalo in 2016. He 
has worked as a Research Assistant in Korea Electron- 
ics Technology Institute (KETI). His research interest 
includes learning and reasoning for the intersection 
of vision and language such as video understanding 
and neuro-symbolic graphs. His research has been pub- 
lished in ACL and NAACL. He is currently working on 
learning by asking questions for video understanding. 


Byoung-Tak Zhang is the POSCO Chair Professor of 
Computer Science and Engineering at Seoul National 
University (SNU) and the Founding Director of the 
SNU Artificial Intelligence Institute. He earned his 
Ph.D. from the University of Bonn, Germany, and his 
BS and MS from SNU. Before joining SNU in 1997, 
he was a Research Fellow at the German National 
Research Center for Computer Science (GMD). His vis- 
iting professorships include MIT CSAIL and Brain and 
Cognitive Science Department, Samsung Advanced 
Institute of Technology, Cognitive Technical Systems 
Excellence Center (CoTeSys) in Munich, Cognitive 
Interaction Technology Excellence Center (CITEC) in 
Bielefeld, and Princeton Neuroscience Institute (PNI). 
His recent research focuses on brain-inspired archi- 
tectures for the embodied cognitive agents that learn 
fully autonomously with sensors and actuators in 
multisensory interaction with the real world. 


ISU SUOWIOD IANLAJ afqvordde ay) Áq pourəao3 are saone YO ‘asn Jo SAMI 107 Aresqr] IUUQ ÁM UO (SUONIPUOD-pur-sULIA}/LUOD: Á M ÁreIqr oUr u0//:SdNY) SUONIPUOJ pur SUNAL ay) 99g “[EZOZ/TI/ET] UO AreAgr] uuo ALAA ‘Purjog IWY Aq YZITIWweL/ZOOT 01/10P/W02 Kopi Arviquourpuo//:sdyy Woy papropumog “p ‘EZOZ ‘IZIGILET 


